Analysis of JAX Envision™ Data using envisionR

Author

Michael C. Saul, michael.saul [at] jax.org

Introduction

The JAX Envision™ System

JAX Envision™ is JAX’s cloud-based software for digital in vivo studies. This system is useful for high-throughput phenotyping of various behavioral and physiological measures. It pairs with the Allentown Discovery™ IVC system for 24/7 real-time home cage monitoring of mice.

Purpose of this Vignette

The purpose of this vignette is to introduce methods for the analysis of experimental data from the JAX Envision system using the R Project for Statistical Computing. R is a programming language frequently used by biologists and data scientists for statistical analysis and visualization of biological data.

Some familiarity with R would be helpful to understand these examples more fully. If you need additional help in learning and using R, see the Appendix on Learning Statistics and Using R for some resources.

How to Use this Vignette

The code blocks in this vignette are designed to generalize to multiple applications. It is hoped that with little modification, the code in this vignette may be adapted by others for their own work.

Outline of this Vignette

This vignette includes the following sections:

Additionally, there are multiple appendices:

Introducing the Dataset

The dataset utilized here was collected on a pre-release version of the Allentown rack system in the JAX Envision suite. It tracks 8 cages of mice treated with either caffeine at 16 mg/kg or vehicle.

This vignette primarily analyzes activity-related digital measures. Activity is typically computed as a velocity of mice in cm/s and is useful for a wide variety of disciplines. Other digital measures will be available in the future. It is anticipated that this vignette will generalize to any measure, but discretion should be used prior to making this assumption.

Notes on R Coding Conventions

We will use tidyverse, janitor, and here, all available from CRAN and installable using install.packages(), for data import. Note that we will make frequent use of tidy data and the pipe operator |> in R. Pipes and tidy data manipulation facilitate rapid transformation ana analysis of data frame type objects in R.

Additionally, all assignment operators for global variables used here are <- and all functions from packages outside of base R utilize the {package}::{function}() format to ensure that you knows which R package contains each function used in this vignette. This includes functions from envisionR.

Notes on Replicability and Reproducibility

This vignette contains only the code needed to analyze JAX Envision™ data. It is likely that replicability and reproducibility will require additional best practices. This may include:

  • Making literate coding analysis notebooks with R Markdown, Quarto, or Jupyter.
  • Containerizing compute environments using Docker, Singularity, or KubeFlow.
  • Version control with git and GitHub.
  • Careful use of the here package and relative file paths to ensure analysis file paths are preserved when data and analysis scripts are moved.
  • Potential use of the workflowR package to combine many of these best practices.

Data Wrangling and Quality Control

The first step in analysis is to export the data from the JAX Envision app. This can be accomplished using the Export functions from a given study, accessible at “Analyze > Export Metrics”. The analyses outlined here assumes that you have downloaded cage mean metrics at 1 hour time resolution.

R Libraries for Data Import

We need a few libraries for data import, wrangling, and QC:

  • tidyverse includes tools for gracefully importing, manipulating, and plotting data (includes dplyr, tibble, and ggplot2).
  • janitor has functions for cleaning datasets and can automatically rename column titles in snake case.
  • here lets us use relative file paths.

We also bring in envisionR.

# R code

# Importing libraries
library(tidyverse)
library(janitor)
library(ggplot2)
library(here)
library(envisionR)

Generating JAX Envision URLs

It is often usesful to create a URL to visualize the specific moments of video highlighted in an analysis. Such videos may allow the end user to look at exemplary or abberant moments in a study. These may be used to illustrate activities of interest. We can also use the process of making a URL to look at time and how it is encoded in R. We will use time frequently here.

Envision URLs are comprised of a set of components that reference a specific organization, study, cage, and time.

The time stamps used in JAX Envision refer to UNIX timestamps, but with the units in milliseconds instead of seconds. This format facilitates accessing a precise frame of data in the stream. The function defined below takes a set of parameters defining organization, study, cage, time, and video start time. It uses these to generate URLs of interest. Please note that the cage number is the URL cage number, not the cage name in the app.

We can use the as.POSIXct function to make a time that we’re interested in. In this case, we wish to look at an example of a time in the mouse safety dataset just after the initial dose of a large amount of caffeine.

# R code

# Getting a time as a POSIXct object
# NOTE: time zone must be correct to refer to the right moment
example_vidstart <- as.POSIXct("2024-06-17 09:00:00",
                               tz = "US/Central")

# Using a function to generate an Envision URL
make_envision_url(org = 9,
                  study = 237,
                  cage = 1823,
                  vidstart = example_vidstart)
[1] "https://envision.jax.org/org/9/study/237/cage/1823?metricsTab=cage&rangeEnd=1718676000000&rangeStart=1718589600000&videoStart=1718632800000&videoStream=overlay"

We can copy-paste this into our address bar and go to this moment in time. It is also possible to use this method to insert links using this URL into literate code documents such as RMarkdown or Quarto documents.

Setup for Data Import

Before importing the data, a good practice is to generate a metadata object with information you will use over and over again in the analysis. For example, the study title, time zone, genotypes, and lights-on and lights-off times from the study are frequently needed throughout the analysis. By defining the metadata before even bringing in the data themselves, you may save yourself from some potential headaches in the future.

The time zone of a study is typically a time zone identifier. Setting this time zone identifier in the metadata allows the analyst to avoid any future ambiguities in time.

The study’s app URL contains some important information. Each study has its own org ID and study ID. These IDs can help us

We use a function called a envision_metadata(). This does some checks of our input, some data type conversion under the hood, and acts as a convenient source for study information in downstream steps.

# R code

# Getting study metadata
metadata <- envisionR::envision_metadata(study_name = "Caffeine vs. Vehicle Study",
                                         tzone = "US/Central", # The time zone of the study
                                         lights_on = "06:00:00", # When lights-on happens in the study
                                         lights_off = "18:00:00", # When lights-off happens in the study
                study_url = "https://envision.jax.org/org/9/study/237/")

Another good practice is to set a random seed at the beginning of a script. This allows any analysis based upon randomization to produce precisely the same values over and over again. The date when the analyst began writing the script is often a good random seed.

# R code

# Setting random seed
set.seed(20250117)

Importing Study Annotation Data

Study annotation data are used to make notes of important events and information for videos. The function envisionR::read_annotation_csv() brings these files into R as a tibble data type. It attempts to assume a time zone and sets all column data types. To import the study annotation data, we use “Analyze > Export Metrics > Export Annotations (CSV)” in the JAX Envision app.

To use envisionR::read_annotation_csv(), we run the code below:

# R code

# Getting the annotation path
annot_path <- system.file("extdata", "annotation.csv", package = "envisionR")

# Opening annotation file
annotation <- envisionR::read_annotation_csv(annot_path,
                                             metadata = metadata)

Note that read_annotation_csv() warns us that it is making an assumption about the time zone. Most of the time, these assumptions should work. However, you can set the time zone explicitly. For a list of valid time zones, use envisionR::list_tzones().

We can use the glimpse() function to look at this dataset as it was brought into R.

# R code

# Glimpsing annotation dataset
dplyr::glimpse(annotation)
Rows: 48
Columns: 18
$ id                   <dbl> 23827, 23828, 23829, 23830, 23831, 23832, 23833, …
$ created              <dttm> 2024-06-10 09:09:16, 2024-06-10 09:10:21, 2024-0…
$ created_date_local   <date> 2024-06-10, 2024-06-10, 2024-06-10, 2024-06-10, …
$ created_time_local   <time> 09:09:16, 09:10:21, 09:11:18, 09:12:50, 09:13:27…
$ pin_start_date_local <date> 2024-06-10, 2024-06-10, 2024-06-10, 2024-06-10, …
$ pin_start_time_local <time> 06:06:22, 06:06:31, 06:06:37, 06:07:05, 06:07:10…
$ pin_end_date_local   <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ pin_end_time_local   <time> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ study_code           <chr> "GEN3Bridge", "GEN3Bridge", "GEN3Bridge", "GEN3Br…
$ group_name           <chr> "Vehicle", "Vehicle", "Caffeine 16 mpk", "Caffein…
$ cage_name            <chr> "A1", "B1", "C1", "A2", "C2", "A3", "B3", "C3", "…
$ creator              <chr> "Marie Curie", "Marie Curie", "Marie Curie", "Mar…
$ contents             <chr> "Cage removed for ear tag placement", "cage remov…
$ reply_to             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ hashtags             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ pin_start_time       <dttm> 2024-06-10 06:06:22, 2024-06-10 06:06:31, 2024-0…
$ pin_end_time         <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ tzone                <chr> "US/Central", "US/Central", "US/Central", "US/Cen…

A note about the created, pin_start_time, and pin_end_time variables: these are date-times, which are currently in the assumed local time zone for the study. They can be coerced to a different time zone such as UTC using the command as.POSIXct(time, tz = "newtimezone") where time is your time and "newtimezone" is the name of your new time zone. The underlying time will remain the same, but the displayed time will be appropriate for a different time zone.

Grabbing Annotation Information

We note that in this case, all the grouping variables we are interested in using were not saved in the group_name field. Instead, the grouping information exists entirely in the annotation contents information.

As the contents of the annotation dataset is a free form field and not a controlled vocabulary, we cannot depend upon the contents to faithfully reflect our dataset without some work. Instead, we can mine these data for the grouping information using regular expression. We can look at the contents of the annotation file prior to making a dosing sheet. We know the following things about this annotation dataset:

  • All dose information contains the word “Dose” or “Dosing” somewhere. Capitalization might not be consistent.
  • We are not interested in replies to others, which start with the character “@”.
  • All vehicle doses have the word “Vehicle” in them. Capitalization might not be consistent.
  • All caffeine doses have the word “Caffeine” in them. Capitalization might not be consistent.
  • All cage insertions have the word “insert” in them.
  • All 16 mg/kg doses have the number 16 in them.

From this information, we can use a string of regular expressions to extract the dosing information. We will make a data frame to hold these grouping data.

# R code

# Extracting dose data
annotation_group <- annotation |>
  dplyr::filter(grepl("[Dd]os[ei]", contents) & 
                !grepl("^@", contents) &
                grepl("insert", contents)) |>
  dplyr::mutate(drug = ifelse(grepl("[Vv]ehicle", contents), "Vehicle", ""),
                drug = ifelse(grepl("[Cc]affeine", contents), "Caffeine", drug),
                dose = ifelse(drug == "Vehicle", 0, -100),
                dose = ifelse(grepl("16", contents), 16, dose)) 

Next, we QC the drug information.

# R code

# Making a table with the drug and dose information
annotation_group |>
  dplyr::group_by(drug, dose) |>
  dplyr::summarize(n = length(contents))
`summarise()` has grouped output by 'drug'. You can override using the
`.groups` argument.

The grouping variables have been added. Next, we create a data frame that only includes one entry per cage.

# R code

annotation_group <- annotation_group |>
  group_by(cage_name) |>
  summarize(drug = unique(drug),
            dose = unique(dose),
            drug_dose_mgperkg = paste(str_to_title(drug), " (", dose, 
                                      " mg/kg)", sep = ""),
            dose1_time = pin_start_time[which(pin_start_time < ymd_hms("2024-06-18 18:00:00"))],
            dose2_time = pin_start_time[which(pin_start_time > ymd_hms("2024-06-18 18:00:00"))])

Now we check our work.

# R code

annotation_group

The annotation data look good and are ready for downstream work. Now that we have wrangled the data, one thing we can do to help ourselves in the future is to serialize the annotation data. This saves the data to disk in a .RDS format, which simplifies bringing them back into R. In this case, we make a new folder in ../data/ that contains serialized data, then serialize the annotation data.

# R code

# Making folder to hold serialized R objects
# NOTE: this function throws a warning if the directory already exists
outdir = tempdir()

# Serializing annotation
saveRDS(annotation, paste0(outdir, "/annotation.RDS"))
saveRDS(annotation_group, paste0(outdir, "/annotation_group.RDS"))

We have finished wrangling and QCing the annotation data.

Importing Cage Demographics Data

The cage demographics data may be exported from the JAX Envision app using “Manage > Overview > … > Export Animals and Cages”.

# R code

# Getting the demographic data path
demo_path <- system.file("extdata", "cagedata.csv", package = "envisionR")

# Importing cage data
demodata <- envisionR::read_demographics_csv(demo_path)
Warning: The following named parsers don't match the column names: Murine ear
tag

We will now use the glimpse() function to see these data.

# R code

# Glimpsing cage data
dplyr::glimpse(demodata)
Rows: 24
Columns: 19
$ group_id          <chr> "Caffeine 16 mpk", "Caffeine 16 mpk", "Caffeine 16 m…
$ cage_id           <chr> "B3", "B3", "B3", "A3", "A3", "A3", "A2", "A2", "A2"…
$ animal_id         <chr> "D3B24", "D3B23", "D3B22", "D3B21", "D3B20", "D3B19"…
$ envision_ear_tag  <chr> "S4", "S3", "S2", "S4", "S3", "S2", "S4", "S3", "S2"…
$ strain            <chr> "CD1", "CD1", "CD1", "CD1", "CD1", "CD1", "CD1", "CD…
$ coat_color        <chr> "white", "white", "white", "white", "white", "white"…
$ genotype          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ additional_detail <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ sex               <chr> "male", "male", "male", "male", "male", "male", "mal…
$ birth_date        <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ death_date        <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ ear_notch         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ metal_ear_tag     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ other_id          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ rap_id_tag_code   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ rap_id_tag_color  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ rfid              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ tail_tattoo       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ cage_dbid         <dbl> 1826, 1826, 1826, 1825, 1825, 1825, 1824, 1824, 1824…

We can see that the cagedata data frame contains some important experimental information from the study.

  • Group, encoded as group_id, is a user-defined field that can include any grouping variable of interest to the experiment.
  • Strain and genotype, encoded in strain and genotype, are frequently used fields that may be important to your specific analysis.
  • RapID eartag color, encoded as rap_id_tag_color, may be used to translate between animal numbers in the study and visually identifiable animals in a specific cage.
  • Sex, encoded in sex, is often a variable of interest for analysis.
  • Birth dates from each animal appear in the birth_date column. These can be used to compute animal ages for longitudinal studies. See the section on Computing Experimental Time Encoding from the Appendix: Tips and Tricks with Time.

The demodata data frame needs little QC and is ready for serialization.

# R code

# Serializing the cagedata data frame
saveRDS(demodata, paste0(outdir, "/demodata.RDS"))

We have finished wrangling and QCing the cage data.

Importing Activity Data

We are now ready to import the activity data aggregated to 1 hour intervals.

# R code

# Getting the activity data path
act_path <- system.file("extdata", "activity.csv", package = "envisionR")

# Importing the activity data and cleaning column names.
activity <- envisionR::read_activity_csv(act_path,
                                         metadata = metadata)

We can look at the dataset using dplyr::glimpse().

# R code

# Glimpsing the dataset again
dplyr::glimpse(activity)
Rows: 3,664
Columns: 14
$ start                                <dttm> 2024-06-05 12:00:00, 2024-06-05 …
$ start_date_local                     <date> 2024-06-05, 2024-06-05, 2024-06-…
$ start_time_local                     <time> 12:00:00, 13:00:00, 14:00:00, 15…
$ study_code                           <chr> "GEN3Bridge", "GEN3Bridge", "GEN3…
$ aggregation_seconds                  <dbl> 3600, 3600, 3600, 3600, 3600, 360…
$ group_name                           <chr> "Caffeine 16 mpk", "Caffeine 16 m…
$ cage_name                            <chr> "A2", "A2", "A2", "A2", "A2", "A2…
$ animals_cage_quantity                <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
$ light_cycle                          <chr> "Light", "Light", "Light", "Light…
$ movement_mean_per_cage_cm_s_hour     <dbl> 2.906378, 3.676216, 1.868613, 2.1…
$ wheel_occupancy_mean_per_cage_s_hour <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ food_occupancy_mean_per_cage_s_hour  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ water_occupancy_mean_per_cage_s_hour <dbl> 0.32580777, 0.11608541, 0.0410775…
$ tzone                                <chr> "US/Central", "US/Central", "US/C…

The variable from this export include the following:

Variable Name Variable Description
start The date and time at the start of the aggregation bin (in UTC).
start_date_local The start date (in the time zone in which the data were collected).
start_time_local The start time (in the time zone in which the data were collected).
study_code A unique code for the study.
aggregation_seconds The number of seconds aggregated to generate this dataset (3600 is 1 hour)
group_name A user-defined group name, often used to label experimental groups of interest.
cage_name The name of the cage that the data represent.
animals_cage_quantity The number of animals in the cage, sometimes called cage density or occupancy.
light_cycle Whether the data were collected in the light (Light) or dark (Dark) cycle.
movement_mean_per_cage_cm_s_hour Cage-level movement in cm/s.
wheel_occupancy_mean_per_cage_animals_hour Amount of time spent on the wheel at the cage level.
food_occupancy_mean_per_cage_animals_hour Amount of time spent in proximity to the food hopper at the cage level.
water_occupancy_mean_per_cage_animals_hour Amount of time spent in proximety to the water bottles at the cage level.

NOTE: at the cage level, none of these variables has not been normalized for cage occupancy. Unnormalized data may have issues when occupancy changes. In each dataset, it is necessary to evaluate whether cage occupancy might play an effect.

We now run a summary of the dataset.

# R code

# Summarizing activity
summary(activity)
     start                     start_date_local     start_time_local         
 Min.   :2024-06-05 12:00:00   Min.   :2024-06-05   Min.   :00:00:00.000000  
 1st Qu.:2024-06-10 06:00:00   1st Qu.:2024-06-10   1st Qu.:06:00:00.000000  
 Median :2024-06-15 00:30:00   Median :2024-06-15   Median :12:00:00.000000  
 Mean   :2024-06-15 00:30:00   Mean   :2024-06-14   Mean   :11:30:15.720524  
 3rd Qu.:2024-06-19 19:00:00   3rd Qu.:2024-06-19   3rd Qu.:17:00:00.000000  
 Max.   :2024-06-24 13:00:00   Max.   :2024-06-24   Max.   :23:00:00.000000  
                                                                             
  study_code        aggregation_seconds  group_name         cage_name        
 Length:3664        Min.   :3600        Length:3664        Length:3664       
 Class :character   1st Qu.:3600        Class :character   Class :character  
 Mode  :character   Median :3600        Mode  :character   Mode  :character  
                    Mean   :3600                                             
                    3rd Qu.:3600                                             
                    Max.   :3600                                             
                                                                             
 animals_cage_quantity light_cycle        movement_mean_per_cage_cm_s_hour
 Min.   :3             Length:3664        Min.   : 0.000                  
 1st Qu.:3             Class :character   1st Qu.: 2.842                  
 Median :3             Mode  :character   Median : 4.168                  
 Mean   :3                                Mean   : 4.221                  
 3rd Qu.:3                                3rd Qu.: 5.415                  
 Max.   :3                                Max.   :16.902                  
                                          NA's   :10                      
 wheel_occupancy_mean_per_cage_s_hour food_occupancy_mean_per_cage_s_hour
 Min.   :0                            Min.   :0                          
 1st Qu.:0                            1st Qu.:0                          
 Median :0                            Median :0                          
 Mean   :0                            Mean   :0                          
 3rd Qu.:0                            3rd Qu.:0                          
 Max.   :0                            Max.   :0                          
 NA's   :10                           NA's   :10                         
 water_occupancy_mean_per_cage_s_hour    tzone          
 Min.   :0.0000                       Length:3664       
 1st Qu.:0.1952                       Class :character  
 Median :0.4244                       Mode  :character  
 Mean   :0.4319                                         
 3rd Qu.:0.6298                                         
 Max.   :1.3334                                         
 NA's   :10                                             

To ensure a tidy dataset, we can add the group names into this data frame by using the annotation_group data frame generated above. We use dplyr::left_join and join on the variable cage_name.

# R code

# Doing a left join of activity and annotation_group
activity <- activity |>
  dplyr::left_join(annotation_group, by = "cage_name")

Changing Velocity Units

We will be working with the variable movement_mean_per_cage_cm_s_hour, which expresses the activity for that particular hour in the average velocity of mice inside the cage as cm/s. However, what if we wish to do a units conversion?

We can change the units relatively simply with the function velicity_units_convert(). For example, if we wish to know meters per hour (m/h), we can make a new calculated column with this information.

# R code

activity_mperh <- activity |>
  dplyr::mutate(movement_mean_per_cage_m_h_hour = velocity_unit_convert(movement_mean_per_cage_cm_s_hour,
                                                                        units_in = "cm/s",
                                                                        units_out = "m/h"))

dplyr::glimpse(activity_mperh)
Rows: 3,664
Columns: 20
$ start                                <dttm> 2024-06-05 12:00:00, 2024-06-05 …
$ start_date_local                     <date> 2024-06-05, 2024-06-05, 2024-06-…
$ start_time_local                     <time> 12:00:00, 13:00:00, 14:00:00, 15…
$ study_code                           <chr> "GEN3Bridge", "GEN3Bridge", "GEN3…
$ aggregation_seconds                  <dbl> 3600, 3600, 3600, 3600, 3600, 360…
$ group_name                           <chr> "Caffeine 16 mpk", "Caffeine 16 m…
$ cage_name                            <chr> "A2", "A2", "A2", "A2", "A2", "A2…
$ animals_cage_quantity                <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
$ light_cycle                          <chr> "Light", "Light", "Light", "Light…
$ movement_mean_per_cage_cm_s_hour     <dbl> 2.906378, 3.676216, 1.868613, 2.1…
$ wheel_occupancy_mean_per_cage_s_hour <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ food_occupancy_mean_per_cage_s_hour  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ water_occupancy_mean_per_cage_s_hour <dbl> 0.32580777, 0.11608541, 0.0410775…
$ tzone                                <chr> "US/Central", "US/Central", "US/C…
$ drug                                 <chr> "Caffeine", "Caffeine", "Caffeine…
$ dose                                 <dbl> 16, 16, 16, 16, 16, 16, 16, 16, 1…
$ drug_dose_mgperkg                    <chr> "Caffeine (16 mg/kg)", "Caffeine …
$ dose1_time                           <dttm> 2024-06-17 06:34:33, 2024-06-17 …
$ dose2_time                           <dttm> 2024-06-20 17:06:20, 2024-06-20 …
$ movement_mean_per_cage_m_h_hour      <dbl> 104.62962, 132.34378, 67.27007, 7…

The new column movement_mean_per_cage_m_h_hour has the average amount of meters the animal traveled per hour during that time period. Note that for the hourly digests, the unit of time cancels out:

\(\displaystyle{{meters \over{hour / hour}} = {meters \over{1}} = meters}\).

The number we get by converting to m/h is the distance that the animals in that cage travelled in the space of an hour.

Normalization by Cage Occupancy

It should be noted that cage occupancy is a user-defined feature of each dataset. It is recommended that cage occupancy is quality checked in the app for each dataset. The simplest way to evaluate whether occupancy is an issue is to look at the final cage-day of data from a dataset and see if the number of animals matches what is annotated in the app. If the number is less than what is annotated, the analyst may need to reannotate cage occupancy.

Here, we evaluate whether the cages have a consistent annotated occupancy.

# R code

# Making a table of cage occupancy
table(activity$animals_cage_quantity, activity$cage_name)
   
     A1  A2  A3  B1  B3  C1  C2  C3
  3 458 458 458 458 458 458 458 458

After checking the data at the end of the study, we see that three cages have occupancies of less than 3: C8, C12, and C17.

We now visualize whether there are apparent differences in activity (the variable called movement_mean_per_cage_cm_s_hour) that might be influenced by cage occupancy. This starts with a violin and box-and-whiskers plot using ggplot2.

# R code

# Visualizing whether there are occupancy issues.
lightdark_violin_plot(activity_data = activity,
                      metadata = metadata,
                      visualize_on = "minoccupancy",
                      yvar = "movement_mean_per_cage_cm_s_hour")
Warning: Removed 10 rows containing non-finite outside the scale range
(`stat_ydensity()`).
Warning: Removed 10 rows containing non-finite outside the scale range
(`stat_boxplot()`).

We see that occupancy influences the distributions of activity in the dataset. The cage-level activity metrics are based upon the total distance traveled by all mice in a cage. Consequently, the occupancy issue should be resolved before further analysis. In this dataset, the following differences from an occupancy of 3 occur:

  • For cage C8, occupancy drops to 2 at approx. 08:00 CDT on 2023-06-12 and drops to 1 at approx. 07:15 CDT on 2023-06-13.
  • For cage C12, only 1 mouse is present throughout the course of the experiment.
  • For cage C17, occupancy is 2 throughout the entire experiment.

Normalizing for cage occupancy may make the distributions match one another better than they currently do. Performing the normalization by dividing movement_mean_per_cage_cm_s_hour by animals_cage_quantity.

# R code

# Computing a per animal normalized metric
activity_cagenorm <- activity|>
  dplyr::mutate(movement_mean_per_animal_cm_s_hour = movement_mean_per_cage_cm_s_hour / animals_cage_quantity)

Now we are visualizing to see if this normalization had the desired effect.

# R code

# Visualizing change in occupancy.
lightdark_violin_plot(activity_data = activity_cagenorm,
                      metadata = metadata,
                      visualize_on = "minoccupancy",
                      yvar = "movement_mean_per_animal_cm_s_hour")
Warning: Removed 10 rows containing non-finite outside the scale range
(`stat_ydensity()`).
Warning: Removed 10 rows containing non-finite outside the scale range
(`stat_boxplot()`).

The occupancy corrections resolve issues with the cages. Downstream functions can perform cage-level normalization on the fly, so we will use the raw activity dataset in these functions. The data are ready to be serialized.

# R code

# Serializing the cleaned activity data.
saveRDS(activity, paste0(outdir, "/activity.RDS"))

The data have been quality checked and are ready for initial visualization.

Initial Data Visualization and Analysis

Exploratory analysis of JAX Envision data frequently starts out with visualization of the dataset with respect to experimental factors of interest. This

On Figure Colors

In this experiment, we can represent different experimental conditions using different colors. A good color scheme to use is the Okabe-Ito color scheme, which has been optimized for effective use as a categorical color scheme that includes eight different colors distinguishable by people with color vision deficiency.

Most plots we make will use an R package called ggplot2. This R package is a modular “grammar of graphics” plotting library that allows us to layer plots together as we like and to change graphical elements as we wish. Almost any plot produced in this vignette can be modified by the user as they choose.

The ggokabeito package brings the Okabe-Ito palette into ggplot2.

In this experiment, we use the gray to represent vehicle and dark blue to represent 16 mg/kg of caffeine respectively. We bring in the library and set an order for the nine colors:

# R code

# Getting R libraries
library(ggokabeito)
okabe_order = c(5,8,6,1,3,4,2,7,9)

Violin and Box-and-Whiskers Plots

As was shown in the above section on normalization by cage occupancy, violin and box-and-whiskers plots are useful for visualizing large-scale effects on whole experiments. We can apply this method to the dataset by generating the same violin and box-and-whiskers plot as above, but with the colors changed to show the grouping variables.

# R code

lightdark_violin_plot(activity_data = activity_cagenorm,
                      metadata = metadata,
                      visualize_on = "group_name",
                      yvar = "movement_mean_per_animal_cm_s_hour") +
  scale_color_okabe_ito(order = okabe_order)
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Warning: Removed 10 rows containing non-finite outside the scale range
(`stat_ydensity()`).
Warning: Removed 10 rows containing non-finite outside the scale range
(`stat_boxplot()`).

We can see that at this high level, there is no apparent effect of drug on activity. We see that there is some unevenness in light period baseline activity. This is most likely due to differences in occlusion by nesting material between cages.

One result of the differences in occlusion is differences between cages that manifest particularly during the light period. These differences can affect individual animal datasets and are one source of what we call the cage effect. Put simply, animals within each cage behave more similarly to each other than they do to animals in other cages.

One way to resolve this and other cage-level issues without much extra work is to analyze experiments at the cage level.

Spaghetti Plots with Day/Night Rectangles

We can make a type of plot called a spaghetti plot, which simply shows the various different measurements of activity over the course of experimental time. This is often a useful plot that can show us a lot of information at a glance. It is frequently one of the first plots we make in an experiment.

envisionR has a template for a spaghetti plot that can be made from an activity dataset with a metadata object. By default, this template facets the dataset by its different levels in the group_name column. We note that this function can perform normalization for occupancy; in fact, it normalizes for occupancy by default. Thus, we can pass it the raw activity data and ask it to do an occupancy normalization.

# R code

spaghetti_plot(activity_data = activity, 
               metadata = metadata,
               yvar = "movement_mean_per_cage_cm_s_hour",
               occupancy_norm = TRUE) +
  ggokabeito::scale_color_okabe_ito(order = okabe_order)
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Warning: Removed 8 rows containing missing values or values outside the scale range
(`geom_line()`).

In the spaghetti plot on hour-aggregated data, we can see that some effects are visible at even the roughest and least processed level of the dataset.

We also see that there is one very noisy cage in the Vehicle (0 mg/kg) group. The noisy cage turns out to be cage C12, which is a singly housed animal. We can deal with this outlier cage in a couple of different ways. One way is to simply throw the cage out. Another way is to use statistical techniques that weight that cage lower than other cages.

Ribbon Plot: Weighted Mean and Standard Error

In an ideal experiment, all cages have the same density. This is often not possible in a digital in vivo study for a variety of reasons. To compare cages with different amounts of animals, we often must normalize metrics affected by this effect by cage occupancy.

However, this technique still gives equal weight to cages with 1 animal as it does to cages with 3 animals. As an alternative to equal weighting of cages, we can also weight each observation by the amount of mice in the cage.

We can make a ribbon plot, which has the mean as a line and the standard error of the mean as a lighter colored and slightly transparent box around the line. This allows us to identify moments in time where standard error bars overlap at a glance. This can be done with weighted standard error using the envisionR function group_mean_sem_ribbon_plot(). As with the spaghetti_plot() function, this function includes the ability to use the metadata and the ability to normalize by cage occupancy.

We can run this function with unweighted mean and standard error:

# R code

group_mean_sem_ribbon_plot(activity_data = activity,
                           metadata = metadata,
                           yvar = "movement_mean_per_cage_cm_s_hour",
                           occupancy_norm = TRUE,
                           occupancy_weight = FALSE) + 
  scale_color_okabe_ito(order = okabe_order) + 
  scale_fill_okabe_ito(order = okabe_order)
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_line()`).

We can do the same with weighted mean and standard error:

# R code

group_mean_sem_ribbon_plot(activity_data = activity,
                           metadata = metadata,
                           yvar = "movement_mean_per_cage_cm_s_hour",
                           occupancy_norm = TRUE,
                           occupancy_weight = TRUE) + 
  scale_color_okabe_ito(order = okabe_order) + 
  scale_fill_okabe_ito(order = okabe_order)
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_line()`).

These look fairly similar, but the weighted dataset typically reflects a slightly greater degree of uncertainty. The use of weighted or unweighted metrics is something that should be carefully considered.

Zooming In on Interesting Time Points

Both of these plotting functions allow us to zoom in on specific time points. For each of them, we can insert a specific time in as a xlim variable. For the safety study, one interesting way of visualizing the data is to zoom into the 30 hours around the first dose.

# R code

# Getting x limits for first dose
xlimits_firstdose <- as.POSIXct(c("2024-06-17 00:00:00", "2024-06-18 06:00:00"),
                                tz = metadata[["tzone"]])

spaghetti_plot(activity_data = activity, 
               metadata = metadata,
               yvar = "movement_mean_per_cage_cm_s_hour",
               occupancy_norm = TRUE,
               xlim = xlimits_firstdose) +
  ggokabeito::scale_color_okabe_ito(order = okabe_order)
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Warning: Removed 8 rows containing missing values or values outside the scale range
(`geom_line()`).

We can do the same with the weighted mean and SEM summary plot.

# R code

group_mean_sem_ribbon_plot(activity_data = activity,
                           metadata = metadata,
                           yvar = "movement_mean_per_cage_cm_s_hour",
                           occupancy_norm = TRUE,
                           occupancy_weight = TRUE,
                           xlim = xlimits_firstdose) + 
  scale_color_okabe_ito(order = okabe_order) + 
  scale_fill_okabe_ito(order = okabe_order)
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_line()`).

We can already see the differences in activity due to differences in treatment.

Understanding Circadianism and Extracting Trend and Cycle

*Note: some of these analyses assume a 24 hour day. Transitions to daylight time violate this assumption. See the subsection on Daylight Time and 24-Hour Data in Appendix: Tips and Tricks with Time.

Short Studies: Circadian-Matched Baseline Subtraction

For many studies, it is appropriate to subtract a baseline day from the same hour on a test day. This method is relatively simple and can work well for short studies. envisionR has a function subtract_timematch() that implements this behavior for activity data.

# R code

# Subtracting time matched data from the day before study days and visualizing the results
activity_timematchsubtract <- subtract_timematch(activity_data = activity,
                                                 var = "movement_mean_per_cage_cm_s_hour",
                                                 occupancy_normalize = TRUE)

dplyr::glimpse(activity_timematchsubtract)
Rows: 3,664
Columns: 12
$ start                   <dttm> 2024-06-05 12:00:00, 2024-06-05 13:00:00, 202…
$ cage_name               <chr> "A2", "A2", "A2", "A2", "A2", "A2", "A2", "A2"…
$ group_name              <chr> "Caffeine 16 mpk", "Caffeine 16 mpk", "Caffein…
$ subtract_var            <chr> "movement_mean_per_cage_cm_s_hour", "movement_…
$ animals_cage_quantity   <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
$ occupancy_norm          <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
$ raw                     <dbl> 0.9687928, 1.2254054, 0.6228711, 0.7059753, 1.…
$ hour                    <chr> "12:00:00", "13:00:00", "14:00:00", "15:00:00"…
$ leading_trailing_na     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ internal_na             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ impute                  <dbl> 0.9687928, 1.2254054, 0.6228711, 0.7059753, 1.…
$ time_matched_subtracted <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

We can plot the output as a spaghetti plot:

# R code

spaghetti_plot(activity_data = activity_timematchsubtract,
               metadata = metadata,
               yvar = "time_matched_subtracted",
               occupancy_norm = FALSE) +
    scale_color_okabe_ito(order = okabe_order) 
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Warning: Removed 200 rows containing missing values or values outside the scale range
(`geom_line()`).

We can also plot the output as a ribbon plot:

# R code

group_mean_sem_ribbon_plot(activity_data = activity_timematchsubtract,
                           metadata = metadata,
                           yvar = "time_matched_subtracted",
                           occupancy_norm = FALSE) +
  scale_color_okabe_ito(order = okabe_order) +
  scale_fill_okabe_ito(order = okabe_order)
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
Warning: Removed 50 rows containing missing values or values outside the scale range
(`geom_line()`).

We can zoom in for these plots in the same way as we did before:

# R code

group_mean_sem_ribbon_plot(activity_data = activity_timematchsubtract,
                           metadata = metadata,
                           yvar = "time_matched_subtracted",
                           occupancy_norm = FALSE,
                           xlim = xlimits_firstdose) +
  scale_color_okabe_ito(order = okabe_order) +
  scale_fill_okabe_ito(order = okabe_order)
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
Warning: Removed 50 rows containing missing values or values outside the scale range
(`geom_line()`).

The Y axis of this plot style shows the difference in velocity relative to baseline of a specific set of hours. This is in the same units as the original activity dataset.

Long Studies: Time Series Decomposition

For longer studies such as longitudinal studies that may last for weeks to months, a technique called time series decomposition may be helpful. This technique breaks raw data down into three components:

  • Trend: Moving averages of the data that encompass a complete cycle. In circadian datasets, this is 24 hours.
  • Periodic (called Circadian in these datasets): The element of a dataset that reliably varies according to the time in the period.
  • Residual: what is left over after subtracting out trend and periodic components from the raw data.

While time series decompositions can be done using both additive and multiplicative processes, we concentrate on an additive time series decomposition here.

The Periodic (Circadian) is very useful as it can be put together with the raw dataset to create a detrended component, which is the raw dataset with the periodic component subtracted out.

For these data, we can perform a time series decomposition using the tsd_cagemetrics() function.

# R code

activity_tsd <- tsd_cagemetrics(activity_data = activity,
                                var = "movement_mean_per_cage_cm_s_hour",
                                occupancy_normalize = TRUE)
Warning in tsd_cagemetrics(activity_data = activity, var =
"movement_mean_per_cage_cm_s_hour", : internal NA values imputed in this
dataset
glimpse(activity_tsd)
Rows: 3,664
Columns: 12
$ start                 <dttm> 2024-06-05 12:00:00, 2024-06-05 13:00:00, 2024-…
$ cage_name             <chr> "A2", "A2", "A2", "A2", "A2", "A2", "A2", "A2", …
$ group_name            <chr> "Caffeine 16 mpk", "Caffeine 16 mpk", "Caffeine …
$ tsd_var               <chr> "movement_mean_per_cage_cm_s_hour", "movement_me…
$ animals_cage_quantity <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
$ occupancy_norm        <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
$ raw                   <dbl> 0.9687928, 1.2254054, 0.6228711, 0.7059753, 1.63…
$ impute                <dbl> 0.9687928, 1.2254054, 0.6228711, 0.7059753, 1.63…
$ circadian             <dbl> -1.32690279, -1.07661177, -0.58299443, -0.371353…
$ trend                 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ residual              <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ detrended             <dbl> 2.2956956, 2.3020171, 1.2058655, 1.0773284, 1.66…

We can plot the circadian components with the spaghetti_plot() function.

# R code

spaghetti_plot(activity_data = activity_tsd,
               metadata = metadata,
               yvar = "circadian",
               occupancy_norm = FALSE) +
  ggokabeito::scale_color_okabe_ito(order = okabe_order)
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.

We can also use the group_mean_sem_ribbon_plot() function to plot detrended data.

# R code

group_mean_sem_ribbon_plot(activity_data = activity_tsd,
                           metadata = metadata,
                           yvar = "detrended",
                           occupancy_norm = FALSE) +
  ggokabeito::scale_color_okabe_ito(order = okabe_order) +
  ggokabeito::scale_fill_okabe_ito(order = okabe_order)
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_line()`).

We can zoom into the first dose using this approach.

# R code

group_mean_sem_ribbon_plot(activity_data = activity_tsd,
                           metadata = metadata,
                           yvar = "detrended",
                           occupancy_norm = FALSE,
                           xlim = xlimits_firstdose) +
  ggokabeito::scale_color_okabe_ito(order = okabe_order) +
  ggokabeito::scale_fill_okabe_ito(order = okabe_order)
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_line()`).

This allows us to

Working with High Temporal Resolution Data

Up until now, we have worked with data aggregated at the 1 hour time scale. However, there are many reasons that an analysis may wish to utilize higher resolution data. The JAX Envision app can export data at 10 minute and 1 minute resolutions.

NOTE: resolutions down to 1 second or less are available for advanced users through the JAX Envision data science suite, where direct access to the Amazon S3 buckets allows downloading the parquet files outputted by the algorithms that compute distances travelled in each cage.

We can use the same process as above to import and plot 1 minute scale data.

# R code

# Getting the activity 1 minute data path
act1m_path <- system.file("extdata", "activity_1min.csv", package = "envisionR")

activity_1min <- read_activity_csv(act1m_path,
                                   metadata = metadata)

dplyr::glimpse(activity_1min)
Rows: 219,408
Columns: 14
$ start                               <dttm> 2024-06-05 11:58:00, 2024-06-05 1…
$ start_date_local                    <date> 2024-06-05, 2024-06-05, 2024-06-0…
$ start_time_local                    <time> 11:58:00, 11:59:00, 12:00:00, 12:…
$ study_code                          <chr> "GEN3Bridge", "GEN3Bridge", "GEN3B…
$ aggregation_seconds                 <dbl> 60, 60, 60, 60, 60, 60, 60, 60, 60…
$ group_name                          <chr> "Caffeine 16 mpk", "Caffeine 16 mp…
$ cage_name                           <chr> "A2", "A2", "A2", "A2", "A2", "A2"…
$ animals_cage_quantity               <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
$ light_cycle                         <chr> "Light", "Light", "Light", "Light"…
$ movement_mean_per_cage_cm_s_min     <dbl> 2.657795, 2.264237, 5.066498, 4.06…
$ wheel_occupancy_mean_per_cage_s_min <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ food_occupancy_mean_per_cage_s_min  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ water_occupancy_mean_per_cage_s_min <dbl> 0.717777778, 0.109444444, 0.405000…
$ tzone                               <chr> "US/Central", "US/Central", "US/Ce…

Note that this file is substantially bigger (approx. 220,000 rows and about 26 MB) and takes much longer to import into R than the hour-long files we were working with above.

We can plot these data using the spaghetti plot function.

# R code

spaghetti_plot(activity_data = activity_1min,
               metadata = metadata,
               yvar = "movement_mean_per_cage_cm_s_min",
               occupancy_norm = TRUE) +
  ggokabeito::scale_color_okabe_ito(order = okabe_order)
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.

We note that the 1 minute scale data have lots of little spikes and valleys relative to the 1 hour data. As time scales become more granular, noise becomes a bigger and bigger issue. We will do a ribbon plot to see if the noise is affecting our inferences.

# R code

group_mean_sem_ribbon_plot(activity_data = activity_1min,
                           metadata = metadata,
                           yvar = "movement_mean_per_cage_cm_s_min",
                           occupancy_norm = TRUE) +
  ggokabeito::scale_color_okabe_ito(order = okabe_order) +
  ggokabeito::scale_fill_okabe_ito(order = okabe_order)
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.

We see that the minute-level noise has made even the summarized dataset look somewhat rough. How do we overcome this issue? One way of doing it is to smooth the data.

There are a number of techniques available to smooth data, but one of the simplest is to run a moving average (sometimes also called a rolling mean) on the dataset. envisionR has a moving average function called moving_average(). We select a relatively small window of 15 minutes for the moving average so that we do not lose too much granular detail. As above, we use the dplyr::mutate() function to insert a new calculated column into the dataset. However, we add another couple of tricks, which are to arrange the dataset in order (think the Sort function in Microsoft Excel) and to group the dataset into smaller pieces so that we can run the moving average on each cage individually. We use the pipe operator, |>, to feed the output of each line into the next lines, which lets us chain together things we want to do with the dataset in a particular order.

# R code

activity_1min <- activity_1min |>
  dplyr::arrange(cage_name) |>
  dplyr::group_by(cage_name) |>
  dplyr::mutate(activity_ma = moving_average(movement_mean_per_cage_cm_s_min, n = 15)) |>
  dplyr::ungroup()

glimpse(activity_1min)
Rows: 219,408
Columns: 15
$ start                               <dttm> 2024-06-05 11:58:00, 2024-06-05 1…
$ start_date_local                    <date> 2024-06-05, 2024-06-05, 2024-06-0…
$ start_time_local                    <time> 11:58:00, 11:59:00, 12:00:00, 12:…
$ study_code                          <chr> "GEN3Bridge", "GEN3Bridge", "GEN3B…
$ aggregation_seconds                 <dbl> 60, 60, 60, 60, 60, 60, 60, 60, 60…
$ group_name                          <chr> "Vehicle", "Vehicle", "Vehicle", "…
$ cage_name                           <chr> "A1", "A1", "A1", "A1", "A1", "A1"…
$ animals_cage_quantity               <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
$ light_cycle                         <chr> "Light", "Light", "Light", "Light"…
$ movement_mean_per_cage_cm_s_min     <dbl> 2.3566303, 3.8535750, 6.2315492, 2…
$ wheel_occupancy_mean_per_cage_s_min <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ food_occupancy_mean_per_cage_s_min  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ water_occupancy_mean_per_cage_s_min <dbl> 0.000000000, 0.122777778, 0.047222…
$ tzone                               <chr> "US/Central", "US/Central", "US/Ce…
$ activity_ma                         <dbl> NA, NA, NA, NA, NA, NA, NA, 2.3497…

We see that the first 8 values for this moving average are NA. This is a consequence of running the moving average: we lose a few observations at the beginning and the end of the dataset.

We can now plot the spaghetti plot for the newly smoothed dataset.

# R code

spaghetti_plot(activity_data = activity_1min,
               metadata = metadata,
               yvar = "activity_ma",
               occupancy_norm = TRUE) +
  ggokabeito::scale_color_okabe_ito(order = okabe_order)
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Warning: Removed 112 rows containing missing values or values outside the scale range
(`geom_line()`).

The dataset appears to be somewhat smoother and we can see some of the effects we’re interested in seeing now (note the spike on June 12 for 16 mg/kg caffeine). Let’s see if this helped us in the ribbon plot.

# R code

group_mean_sem_ribbon_plot(activity_data = activity_1min,
                           metadata = metadata,
                           yvar = "activity_ma",
                           occupancy_norm = TRUE) +
  ggokabeito::scale_color_okabe_ito(order = okabe_order) +
  ggokabeito::scale_fill_okabe_ito(order = okabe_order)
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
Warning: Removed 28 rows containing missing values or values outside the scale range
(`geom_line()`).

We can again zoom into the first dose.

# R code

group_mean_sem_ribbon_plot(activity_data = activity_1min,
                           metadata = metadata,
                           yvar = "activity_ma",
                           occupancy_norm = TRUE,
                           xlim = xlimits_firstdose) +
  ggokabeito::scale_color_okabe_ito(order = okabe_order) +
  ggokabeito::scale_fill_okabe_ito(order = okabe_order)
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
Warning: Removed 28 rows containing missing values or values outside the scale range
(`geom_line()`).

We see that the smoothed 1 minute data have given us a much fuller picture of the effects than the 1 hour aggregates.

Summarization and Statistical Inference

Sometimes we wish to do a statistical test on datasets. A natural question is when this should be done. There are a couple of possibilities. One is to do tests for every time point in the dataset and attempt to glean some information that way. This can be a very difficult proposition as the more tests we run, the more false discoveries we make. We have to control for the false discovery rate, which can be problematic if the data weren’t summarized adequately and are noisy. If we run tests on each minute (or even each hour) after the dose, we might have issues in identifying an effect.

However, if we summarize by taking the mean for a couple of time bins after the initial dose, we may identify an effect. That is what we will do here. We will use the 1 minute scale data.

A Note on Exploratory vs. Hypotheis-Driven Data Analysis

We are forming hypotheses here based upon what we saw in the plot above. This is an exploratory analysis that gives us some information about hypotheses we might wish to adjudicate more carefully in the future. The exploratory mode is a valid mode of scientific inquiry, but exploratory analyses are sometimes considered hypothesis-generating and not hypothesis-adjudicating.

We made the hypothesis that the four-hour period would display a statistically significant difference by looking at the plots from above. In statistics and data science, this is sometimes called hypothesizing after the results are collected, HARKing for short. So long as we are transparent about it and careful with our interpretation, there is no problem with doing this. However, without a lot of extra effort to control false discoveries, it does not allow us to make statements with a great degree of certainty about an effect. It also means we have to be very careful with how we discuss and write up results.

If we were to wish to do a hypothesis-driven experiment, we would need to write an analysis plan before doing an experiment. However, it is difficult to write such a plan when working with a brand new data type where we have no specific expectations. Consequently, this experiment was analyzed using exploratory methods. The results of the hypothesis tests below should probably be interpreted as evidence of how likely it is that future studies could find similar results.

Summarizing at 4 Hours after 1st Dose

First, we bring in the dose data by using the dplyr::left_join() statement.

# R code

activity_1min_joined <- activity_1min |>
  dplyr::left_join(annotation_group, by = "cage_name")

Then we make a 4-hour post-dose summary with baseline subtraction. We use a series of tidyverse operations here.

# R code

activity_1min_summarize <- activity_1min_joined |>
  dplyr::filter((start >= dose1_time & start <= (dose1_time + 60 * 60 * 4)) |
                  (start >= dose1_time - (60 * 60 * 24) & start <= dose1_time - (60 * 60 * 20))) |>
  dplyr::group_by(group_name, cage_name, start_date_local) |>
  dplyr::mutate(activity_occupancynorm = movement_mean_per_cage_cm_s_min / animals_cage_quantity) |>
  dplyr::summarize(postdose_0to4hr = mean(activity_occupancynorm, na.rm = TRUE)) |>
  dplyr::ungroup() |> 
  dplyr::group_by(group_name, cage_name) |>
  dplyr::arrange(start_date_local) |>
  dplyr::summarize(baseline_subtract_postdose_0to4hr = postdose_0to4hr[2] - postdose_0to4hr[1]) |>
  dplyr::ungroup()
`summarise()` has grouped output by 'group_name', 'cage_name'. You can override
using the `.groups` argument.
`summarise()` has grouped output by 'group_name'. You can override using the
`.groups` argument.
activity_1min_summarize

There were a lot of steps in that operation, but it amounts to taking the mean activity in the four hours after a dose, then subtracting the activity from the same four hour period from the day before. We now have cage-level numbers for the four hours post-dose that have been adjusted for baseline activity. These are suitable for plotting as what we call a beeswarm plot using the ggbeeswarm package (and labels using the ggrepel package):

# R code

# Getting the ggplot add-on packages
library("ggbeeswarm")
library("ggrepel")

ggplot2::ggplot(data = activity_1min_summarize,
                aes(x = group_name, y = baseline_subtract_postdose_0to4hr,
                     color = group_name, label = cage_name)) +
  ggbeeswarm::geom_beeswarm() +
  ggrepel::geom_label_repel() +
  ggokabeito::scale_color_okabe_ito(order = okabe_order) +
  ggplot2::theme_bw() +
  ggplot2::theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
                 legend.position = "none") +
  ggplot2::xlab(NULL) +
  ggplot2::ylab("Baseline-Corrected Average Activity:\n4 Hours after Dose 1 (cm/s)")

We can see that there is no overlap, which means that the data will certainly be significant by the non-parametric Wilcoxon rank-sum test. We can also do an t-test on these data.

# R code

activity_4hr_post_dose_ttest <- t.test(baseline_subtract_postdose_0to4hr ~ group_name, 
                                     data = activity_1min_summarize)
activity_4hr_post_dose_ttest

    Welch Two Sample t-test

data:  baseline_subtract_postdose_0to4hr by group_name
t = 3.8928, df = 5.9249, p-value = 0.008251
alternative hypothesis: true difference in means between group Caffeine 16 mpk and group Vehicle is not equal to 0
95 percent confidence interval:
 0.3690985 1.6287600
sample estimates:
mean in group Caffeine 16 mpk         mean in group Vehicle 
                     2.232369                      1.233440 

We see that for this specific comparison, 16 mg/kg caffeine significantly changes activity. We can use the same code as above, but change the dose from dose1 to dose2.

# R code

activity_1min_summarize_darkdose <- activity_1min_joined |>
  dplyr::filter((start >= dose2_time & start <= (dose2_time + 60 * 60 * 4)) |
                  (start >= dose2_time - (60 * 60 * 24) & start <= dose2_time - (60 * 60 * 20))) |>
  dplyr::group_by(group_name, cage_name, start_date_local) |>
  dplyr::mutate(activity_occupancynorm = movement_mean_per_cage_cm_s_min / animals_cage_quantity) |>
  dplyr::summarize(postdose_0to4hr = mean(activity_occupancynorm, na.rm = TRUE)) |>
  dplyr::ungroup() |> 
  dplyr::group_by(group_name, cage_name) |>
  dplyr::arrange(start_date_local) |>
  dplyr::summarize(baseline_subtract_postdose_0to4hr = postdose_0to4hr[2] - postdose_0to4hr[1]) |>
  dplyr::ungroup()
`summarise()` has grouped output by 'group_name', 'cage_name'. You can override
using the `.groups` argument.
`summarise()` has grouped output by 'group_name'. You can override using the
`.groups` argument.
ggplot2::ggplot(data = activity_1min_summarize_darkdose,
                aes(x = group_name, y = baseline_subtract_postdose_0to4hr,
                     color = group_name, label = cage_name)) +
  ggbeeswarm::geom_beeswarm() +
  ggrepel::geom_label_repel() +
  ggokabeito::scale_color_okabe_ito(order = okabe_order) +
  ggplot2::theme_bw() +
  ggplot2::theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
                 legend.position = "none") +
  ggplot2::xlab(NULL) +
  ggplot2::ylab("Baseline-Corrected Average Activity:\n4 Hours after Dose 2 (cm/s)")

In this case, we have a similar finding. We can again do a t-test.

# R code

activity_4hr_post_darkdose_ttest <- t.test(baseline_subtract_postdose_0to4hr ~ group_name, 
                                           data = activity_1min_summarize_darkdose)
activity_4hr_post_darkdose_ttest

    Welch Two Sample t-test

data:  baseline_subtract_postdose_0to4hr by group_name
t = 5.5905, df = 5.1871, p-value = 0.002247
alternative hypothesis: true difference in means between group Caffeine 16 mpk and group Vehicle is not equal to 0
95 percent confidence interval:
 0.5085365 1.3572163
sample estimates:
mean in group Caffeine 16 mpk         mean in group Vehicle 
                   0.85664983                   -0.07622655 

We see that for the dark dose, caffeine similarly has an effect.

Appendix: Tips and Tricks with Time

The ways that time is encoded in digital in vivo studies can impact how a study is analyzed. Here, we discuss some methods, issues, and workarounds for time.

Working with UNIX Time Codes and Integer Division

R uses UNIX timecode for its internal time encoding. Under the hood, R uses something called the POSIXct class to represent time in an easily interpretable and computable way. POSIXct stands for Portable Operating System Interface calendar time. These UNIX timecodes are encoded as the number of seconds since 1970-01-01 00:00:00Z.

In R, POSIXct objects may be coerced to their raw UNIX time code in seconds using the as.numeric() function. Further, numeric seconds may be coerced back into POSIXct objects using the lubridate::as_datetime() function.

# R code

# Producing an arbitrary hour
datetime_minute <- as.POSIXct("2024-06-26 06:12:00", tz = "UTC")

# Showing this minute in numeric format
as.numeric(datetime_minute)
[1] 1719382320

A simple way to manipulate time codes for JAX Envision systems is to add or subtract a number of seconds from them. Additionally, it is often possible to use modulo division to automatically compute things like the hour interval that a specific minute is in. For example:

# R code

# Getting the hour in which this date and time starts
# There are 3600 seconds in an hour
# Subtracting the modulo of the UNIX time stamp divided by 3600
datetime_inhour <- datetime_minute - as.numeric(datetime_minute) %% 3600

# Showing the hour interval for the datetime
datetime_inhour
[1] "2024-06-26 06:00:00 UTC"

This method can be used to quickly compute intervals over a very large vector of data.

Computing Experimental Time Encoding

There are many circumstances in which it is advantageous to compute an experimental time. For example:

  • Longitudinal studies may be aligned by the animals’ age.
  • Studies with multiple runs over time may be aligned by the first light/dark transition after animals are inserted into a cage.

In each of these cases, it is necessary to compute a derived experimental time encoding. This encoding is a separate column that includes the time as seconds, hours, or days after an arbitrary event. envisionR has a function reltime() that does this.

For example, to encode an animal’s age, days after its date of birth works well:

# R code

# Getting a vector of presumed dates of birth
dob = as.POSIXct(c("2024-01-20","2024-01-22"))

# Getting a vector of a date
today = as.POSIXct(c("2024-06-24","2024-06-24"))

# Computing days postnatal
days_postnatal = reltime(rawtimes = today, 
                         reftimes = dob,
                         units = "days")

# Displaying days postnatal
days_postnatal
[1] 155.9583 153.9583

We can use the same function to find weeks postnatal.

# R code

# Getting days at which animals were a certain age postnatal.
weeks_postnatal = reltime(rawtimes = today,  
                          reftimes = dob,
                          units = "weeks")

# Displaying days postnatal
weeks_postnatal
[1] 22.27976 21.99405

There is also a function called weeks_postnatal_seq() to compute a sequence of days upon which an animal is a certain age.

# R code

# Making a weeks postnatal sequence function
first_16 <- weeks_postnatal_seq(dob = as.Date("2024-01-20"), 
                                n_weeks = 16)
first_16
 Postnatal Week 0  Postnatal Week 1  Postnatal Week 2  Postnatal Week 3 
     "2024-01-20"      "2024-01-27"      "2024-02-03"      "2024-02-10" 
 Postnatal Week 4  Postnatal Week 5  Postnatal Week 6  Postnatal Week 7 
     "2024-02-17"      "2024-02-24"      "2024-03-02"      "2024-03-09" 
 Postnatal Week 8  Postnatal Week 9 Postnatal Week 10 Postnatal Week 11 
     "2024-03-16"      "2024-03-23"      "2024-03-30"      "2024-04-06" 
Postnatal Week 12 Postnatal Week 13 Postnatal Week 14 Postnatal Week 15 
     "2024-04-13"      "2024-04-20"      "2024-04-27"      "2024-05-04" 
Postnatal Week 16 
     "2024-05-11" 

Computing UTC Offset

It may be necessary to compute a UTC offset for each timestamp to assist in the subsequent data analysis. There is no base R functionality for this. It also appears to be absent from many of the date and time packages in R. envisionR has a function to do this.

# R code


# Making arbitrary timestamps that have different UTC offsets
ts <- as.POSIXct(c("2024-02-28 00:00:00",
                   "2024-03-28 00:00:00"),
                 tz = "US/Eastern")

# Displaying offsets
get_utc_offset(ts, as_numeric = TRUE)
[1] -5 -4

Daylight Time and 24 Hour Data

For many of the variables of interest, we may assume a 24 hour clock. However, in most of North America, there are two times of year that the 24 hour clock is disrupted: the beginning of daylight time in spring, when one day is 23 hours long, and the end of daylight time in fall, when one day is 25 hours long. For some types of analysis that rely upon predictable periodicity such as time series decomposition, these disruptions may add an unwanted confound to data analysis.

The ideal method of overcoming the effect of daylight time is to design experiments that ensure no effect of daylight time is not present in the dataset. That means one of two things:

  • Running an experiment that does not include a transition to or from daylight time.
  • Keeping the mouse room and the caging system on either standard or daylight time throughout the year.

It should be noted that lights-on and lights-off are strong biological cues called zeitgebers. Alteration of the duration between zeitgebers by even just an hour for one night can disrupt an animal’s behavior and physiology.

However, in some experiments, it is either difficult or impossible to overcome a daylight time transition. In these cases, the analyst may have to find a solution that imputes a 24-hour clock on a day when it does not exist. Recommendations include the following:

  • For transition to daylight time, typically 02:00 will be missing from the local time. However, the internal UNIX time code will not allow for inserting an hour into this position. It is recommended to transition the experiment to an experimental time encoding. Add an hour to all times after the local transition to daylight time. Take the arithmetic mean of the two hours surrounding 02:00 on the transition day as the imputed time at that hour.
  • For transition from daylight time, typically 02:00 will appear twice in the local time. However, the internal UNIX time code will not allow for subtracting an hour from this position. It is recommended to transition the experiment to an experimental time encoding. Take the arithmetic mean of both 02:00 hours on the transition day as the imputed time for that hour.

This method will result in an imputed 24 hour dataset suitable for analyses that assume a 24 hour clock such as time series decomposition.

Appendix: Learning Statistics and Using R

A Description of R

This vignette uses the statitics and data analysis programming language R. If you are a complete novice R user, you may consider attending a Carpentries workshop on R.

If you have some experience with R coding, RStudio is recommended for your analysis. RStudio is an integrated development environment (IDE), a computer program that is designed to assist in writing and running R code as well as generating data visualizations. RStudio generally makes the process of data analysis easier and more enjoyable.

The standard set of tools available in an R installation is often called base R. A wide variety of data structures and functions are available in base R. For example, the data.frame() data structure operates much like a highly formalized Excel spreadsheet, the t.test() function can be used to compare the means of two groups, and the plot() function can be used to do basic data visualization.

In addition, R is extensible and can be enhanced with a wide variety of packages/libraries. Two primary resources are frequently used to share R libraries: the Comprehensive R Archive Network (CRAN) and Bioconductor. Posit, the company that develops RStudio, maintains a suite of R packages called tidyverse, which is among the most widely used R packages. The tidyverse makes use of the tidy data system. It includes three packages that have proven highly valuable for analyzing JAX Envision datasets:

  • dplyr, a grammar for data manipulation.
  • ggplot2, a grammar for data visualizaiton.
  • lubridate, a flexible framework for working with date and time information.

This vignette assumes that you will primarily work with tidyverse packages. Posit, the company that develops RStudio, maintains a variety of cheatsheets that can be used to accelerate R coding. Many of these cheatsheets are helpful for tidyverse coding.

Getting Help within R

To get help on any individual function within R, type a question mark and then the function in the R console. For example, ?t.test will bring up documentation on the usage of the t.test() function that assists a user in correctly using this function. It describes the arguments available to a user of this function, which can modify what the function returns.

Appendix: Information about this Vignette

This vignette was originally written in the Quarto framework, a multi-language data sciences notebooking system. The file was then ported to RMarkdown. The R session info is below:

# R code

sessionInfo()
R version 4.5.0 (2025-04-11)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggrepel_0.9.6        ggbeeswarm_0.7.2     ggokabeito_0.1.0    
 [4] envisionR_0.0.0.9000 here_1.0.1           janitor_2.2.1       
 [7] lubridate_1.9.4      forcats_1.0.0        stringr_1.5.1       
[10] dplyr_1.1.4          purrr_1.0.4          readr_2.1.5         
[13] tidyr_1.3.1          tibble_3.2.1         ggplot2_3.5.2       
[16] tidyverse_2.0.0     

loaded via a namespace (and not attached):
 [1] generics_0.1.3    stringi_1.8.7     hms_1.1.3         digest_0.6.37    
 [5] magrittr_2.0.3    evaluate_1.0.3    grid_4.5.0        timechange_0.3.0 
 [9] fastmap_1.2.0     rprojroot_2.0.4   jsonlite_2.0.0    scales_1.3.0     
[13] cli_3.6.4         rlang_1.1.6       crayon_1.5.3      bit64_4.6.0-1    
[17] munsell_0.5.1     withr_3.0.2       yaml_2.3.10       tools_4.5.0      
[21] parallel_4.5.0    tzdb_0.5.0        colorspace_2.1-1  vctrs_0.6.5      
[25] R6_2.6.1          lifecycle_1.0.4   snakecase_0.11.1  htmlwidgets_1.6.4
[29] bit_4.6.0         vipor_0.4.7       vroom_1.6.5       beeswarm_0.4.0   
[33] pkgconfig_2.0.3   pillar_1.10.2     gtable_0.3.6      Rcpp_1.0.14      
[37] glue_1.8.0        xfun_0.52         tidyselect_1.2.1  rstudioapi_0.17.1
[41] knitr_1.50        farver_2.1.2      htmltools_0.5.8.1 labeling_0.4.3   
[45] rmarkdown_2.29    compiler_4.5.0